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Abstract 

^\ We consider the problem of training probabilistic conditional random fields (CRFs) in the context of a task where 

performance is measured using a specific loss function. While maximum likelihood is the most common approach to 

training CRFs, it ignores the inherent structure of the task's loss function. We describe alternatives to maximum likelihood 

which take that loss into account. These include a novel adaptation of a loss upper bound from the structured SVMs 

^> - literature to the CRF context, as well as a new loss-inspired KL divergence objective which relies on the probabilistic 

, ; nature of CRFs. These loss-sensitive objectives are compared to maximum likelihood using ranking as a benchmark task. 

C^ This comparison confirms the importance of incorporating loss information in the probabilistic training of CRFs, with the 

'tZ. loss-inspired KL outperforming all other objectives. 

T^ 1 Introduction 

(^— •) Conditional random fields (CRFs) [1] form a flexible family of models for capturing the interaction between an input x and 

00 a target y. CRFs have been designed for a vast variety of problems, including natural language processing [2, 3, 4], speech 

^-H processing [5], computer vision [6, 7, 8] and bioinformatics [9, 10] tasks. One reason for their popularity is that they 

r — . provide a flexible framework for modeling the conditional distributions of targets constrained by some specific structure, 
f^ such as chains [1], trees [11], 2D grids [7, 12], permutations [13] and many more. 

^~~i While there has been a lot of work on developing appropriate CRF potentials and energy functions, as well as on 

'^ deriving efficient (approximate) inference procedures for some given target structure, much less attention has been paid 

J> to the loss function under which the CRF's performance is ultimately evaluated. Indeed, CRFs are usually trained by 

maximum likelihood (ML) or the maximum a posteriori criterion (MAP or regularized ML), which ignores the task's loss 

function. Yet, several tasks are associated with loss functions' that are also structured and do not correspond to a simple 

d 0/1 loss: labelwise error (Hamming loss) for item labeling, BLEU score for machine translation, normalized discounted 

cumulative gain (NDCG) for ranking, etc. Ignoring this structure can prove as detrimental to performance as ignoring the 

target's structure. 

The inclusion of loss information into learning is an idea that has been more widely explored in the context of struc- 
tured support vector machines (SSVMs) [14, 15]. SSVMs and CRFs are closely related models, both trying to shape an 
energy or score function over the joint input and target space to fit the available training data. However, while an SSVM 
attempts to satisfy margin constraints without invoking a probabilistic interpretation of the model, a CRF follows a prob- 
abilistic approach and instead aims at calibrating its probability estimates to the data. Similarly, while an SSVMs relies 
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'without loss of generality, for tasks where a performance measure is instead provided (i.e. where higher values is better), we assume it can be 
converted into a loss, e.g. by setting the loss to the negative of the performance measure. 



on maximization procedures to identify the most violated margin constraints, a CRF relies on (approximate) inference or 
sampling procedures to estimate probabilities under its distribution and compare it to the empirical distribution. 

While there are no obvious reasons to prefer one approach to the other, a currently unanswered question is whether the 
known methods that adapt SSVM training to some given loss (i.e., upper bounds based on margin and slack scaling [15]) 
can also be applied to the probabilistic training of CRFs. Another question is how such methods would compare to other 
loss-sensitive training objectives which rely on the probabilistic nature of CRFs and which may have no analog in the 
SSVM framework. 

We investigate these questions in this paper First, we describe upper bounds similar to the margin and slack scaling 
upper bounds of SS VMs, but that correspond to maximum likelihood training of CRFs with loss-augmented and loss-scaled 
energy functions. Second, we describe two other loss-inspired training objectives for CRFs which rely on the probabilistic 
nature of CRFs: the standard average expected loss objective and a novel loss-inspired KL-divergence objective. Finally, 
we compare these loss-sensitive objectives on ranking benchmarks based on the NDCG performance measure. To our 
knowledge, this is the first systematic evaluation of loss-sensitive training objectives for probabilistic CRFs. 

2 Conditional Random Fields 

This work is concerned with the general problem of supervised learning, where the relationship between an input x and 
a target y must be learned from a training set of instantiated pairs T) = {x^, y^}. More specifically, we are interested in 
learning a predictive mapping from x to y. 

Conditional random fields (CRFs) tackle this problem by defining directly the conditional distribution J5(y|x) through 
some energy function _E(y, x; 9) as follows: 

p(y|x)=exp(-i?(y,x;0))/^(x), Z(x) = ^ exp(-£;(y,x; 0)) (1) 

yea'(x) 

where 3^(x) is the set of all possible configurations for y given the input x and 6 is the model's parameter vector The 
parametric form of the energy function E{y, x; 9) will depend on the nature of x and y. A popular choice is that of a linear 
function of a set of features on x and y, i.e., E{y, x;9) = — ^^ 9ifi{x, y). 

2.1 Maximum Likelihood Objective 

The most popular approach to training CRFs is conditional maximum likelihood. It corresponds to the minimization with 
respect to 9 of the objective £ml(^; &)'■ 

"^ E logMy*|xt) = ^ E Eiyt,xt;e) + logl J2 exp(-i?(y,x,;0)) . (2) 

To this end, one can use any gradient-based optimization procedure, which can convergence to a local optimum, or even a 
global optimum if the problem is convex (e.g., by choosing an energy function E{y, x; 9) linear in 6). The gradients have 
an elegant form: 
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(3) 



Hence exact gradient evaluations are possible when the conditional expectation in the second term is tractable. This is the 
case for CRFs with a chain or tree structure, for which belief propagation can be used. When gradients are intractable, two 



approximate alternatives can be considered. The first is to approximate the intractable expectation using either Markov 
chain Monte Carlo sampling or variational inference algorithms such as mean-field or loopy belief propagation, the latter 
being the most popular. The second approach is to use alternative objectives such as pseudolikelihood [16] or piece-wise 
training [17]^. 

3 Loss-sensitive Training Objectives 

Unfortunately, maximum likelihood and its associated approximations all suffer from the problem that the loss function 
under which the performance of the CRF is evaluated is ignored. In the well-specified case and for large datasets, this 
would probably not be a problem because of the asymptotic consistency and efficiency properties of maximum likelihood. 
However, almost all practical problems do not fall in the well- specified setting, which justifies the exploration of alternative 
training objectives. 

Let y(xj) denote the prediction made by a CRF for some given input Xj. Most commonly^, this prediction will be 
y(xt) = argmaxyg3;(x^)p(y|xt) = argminygj;(^^) E{y,xt). We assume that we are given some loss lt{y{xt)) under 
which the performance of the CRF on some dataset T> will be measured. We will also assume that lt{yt) — 0. The goal is 
then to achieve a low average |i ^^^ ^^^ lt{y{xt)) under that loss. 

Directly minimizing this average loss is hard, because lt(y(xt)) is not a smooth function of the CRF parameters 9. 
In fact, the loss itself /t(y(xt)) is normally not a smooth function of the prediction y(xt), and y(x) is also not a smooth 
function of the model parameters 6. This non-smoothness makes it impossible to apply gradient-based optimization. 

However, one could attempt to indirectly optimize the average loss by deriving smooth objectives that also depend on 
the loss. In the next sections, we describe three separate formulations of this approach. 

3.1 Loss Upper Bounds 

The loss function provides important information as to how good a potential prediction y is with respect to the ground truth 
yt- In particular, it specifies an ordering from the best prediction (y ~ y^) to increasingly bad predictions with increasing 
value of their associated loss lt{y)- It might then be desirable to ensure that the CRF assigns particularly low probability 
(i.e., high energy) to the worst possible predictions, as measured by the loss. 

A first way of achieving this is to augment the energy function at a given training example (xt ,yt) by including the 
loss function for that example, producing a Loss-Augmented energy: 

El'^{y,xt;e) = E{y,xt;e)-kiy) . (4) 

By artificially reducing the energy of bad values of y as a function of their loss, this will force the CRF to increase 
even more the value of E{y, x; 9) for those values of y with high loss. This idea is similar to the concept of margin 
re-scaling in structured support vector machines (SSVMs) [14, 15], a similarity that has been highlighted previously by 
Hazan and Urtasun [18]. Moreover, as in SSVMs, it can be shown that by replacing the regular energy function with this 
loss -augmented energy function in the maximum likelihood objective of Equation 2, we obtain a new Loss-Augmented 



^Variational inference-based training can also be interpreted as training based on a different objective. 

^For loss functions that decompose into loss terms over subsets of target variables, it may be more appropriate to use the mode of the marginals over 
each subset as the prediction. 



objective that upper bounds the average loss: 
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= ^ E i?(yt,x,)-i?(y(x,),x,) + /t(y(x,))-/t(yt) 

' ' (xt,yt)eX) 



' ' (xt,yt)eX) 

We see that the higher It (y) is for some given y, the more important the energy term associated with it will be in the global 
objective. Hence, introducing the loss this way will indeed force the optimization to focus more on increasing the energy 
for configurations of y associated with high loss. 

As an alternative to subtracting the loss, we could further increase the weight of terms associated with high loss by also 
multiplying the original energy function, as follows: 

E^''{y,^t;0) ^ lt{y){E{y,^t:e) - Eiyu^t;0)) - Itiy) . (5) 

The advantage of this Loss-Scaled energy is that when a configuration of y with high loss already has higher energy than 
the target (i.e., E{y, :x.t;9) — E{yt, :x.t;9) > 0), then the energy is going to be further increased, reducing its weight in the 
optimization. In other words, focus in the optimization is put on bad configurations of y only when they have lower energy 
than the target. Finally, we can also show that the Loss-Scaled objective obtained from this loss-scaled energy leads to an 
upper bound on the average loss: 



= ^ E l°g( E expH,(y)(ii;(y,x,;0)-£;(y,,x,;0)) + ?,(y)) 
^ ^ E ^t(y(xt))(l + S(yt,Xt;f?)-ii;(y(xO,Xt;0)) 

> ^ E ^*(y(^*))- 



(xt,yt)eX' 

There is a connection with SSVM training objectives here as well. This loss-scaled CRF is the probabilistic equivalent of 
SSVM training with slack re-scaling [15]. 

Since both the loss-augmented and loss-scaled CRF objectives follow the general form of the maximum likelihood 
objective but with different energy functions, the form of the gradient is also that of Equation 3. The two key differences 
are that the energy function is now different, and the conditional expectation on y given Xj is according to the CRF 
distribution with the associated loss-sensitive energy. In general (particularly for the loss-scaled CRF), it will not be 



possible to run belief propagation to compute the expectation^, but adapted forms of loopy belief propagation or MCMC 
(e.g., Gibbs sampling) could be used. 

3.2 Expected Loss 

A second approach to deriving a smooth version of the average loss is to optimize the average Expected Loss, where the 
expectation is based on the CRF's distribution: 

L^i.{V-e) = ^ ^ Ey|., [;,(y)] = ^ ^ ^ k{y)p{y\^t). (6) 

While this objective is not an upper bound, it becomes increasingly closer to the average loss as the entropy of p(y|xt) 
becomes smaller and puts all its mass on y(xt). 
The parameter gradient has the following form: 



lu, VD\ 1^ "^yNt R(yjJ tyix, 



dQ \V\ 



dE{y,:ii.t; 

de 



lEylxt 



h{y)- 



de 



(7) 



If the required expectations cannot be computed tractably, MCMC sampling can be used to approximate them. Another 
alternative is to use a fixed set of representative samples [13]. 

3.3 Loss-inspired Kullback-Leibler 

Both the average expected loss and the loss upper bound objectives have in common that their objectives are perfectly 
minimized when the posteriors p(y|xj) put all their mass on the targets yj. In practice, this is bound not to happen, since 
this is likely to correspond to an overfitted solution which will be avoided using additional regularization. 

Instead of relying on a generic regularizer such as the (^-norva of the parameter vector, perhaps the loss function itself 
might provide cues as to how best to regularize the CRF. Indeed, we can think of the loss as a ranking of all potential 
predictions, from perfect to adequate to worse. Hence, if we are not to put all probability mass on p(yt |xt), we could make 
use of the information provided by the loss in order to determine how to distribute the excess mass 1 — p{yt \^t) on other 
configurations of y. In particular, it would be sensible to distribute it on other values of y in proportion to the loss lt{y)- 

To achieve this, we propose to first convert the loss into a distribution over the target q{y\t) and then minimize the 
Kullback-Leibler (KL) divergence between this target distribution and the CRF posterior: 

CMV^e)^^ Y, ^KL(g(-|t)|b(-|xt)) = -^ E E q{y\t)\ogp{y\^t)~C (8) 

' ' i^t,yt)ev ' ' (xt,yt)ex)yej^(xt) 

where constant C = H{q{-\t)) is the entropy of the target distribution, which does not depend on parameter vector 9. 
There are several ways of defining the target distribution q{y\t). In this work, we define it as follows: 

q{y\t) =cM'k{y) /T)/Zt, ^t = JI exp(-;t(y)/r) (9) 

yey(xt) 

where the temperature parameter T controls how peaked this distribution is around y^ . The maximum likelihood objective 
is recovered as T approaches 0. 

^In the loss-augmented case, one exception is if the loss decomposes into individual losses over each target variable j/^ and the CRF follows a tree 
structure in its output. In this case, the loss terms can be integrated into the CRF unary features and belief propagation will perform exact inference. 
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Figure 1 : Negative derivatives of the objective with respect to energy for each of the five training objectives presented. The 
five training objectives ar: maximum likelihood (ML), loss-augmented ML (LA), loss-scaled ML (LS), expected loss (EL) 
and loss-inspired KuUback-Leibler divergence (KL). For each objective we consider five different configurations: from left 
to right, their energies are [—1, —0.5, 0, 0.5, 1] and the losses are [5, 1, 0, 1, 5]. The middle one therefore corresponds to 
a ground-truth configuration; those to its left are currently more likely under the model, and loss increases with distance 
from this middle one. The derivatives for each objective are normalized by the £'^ norm. 

The gradient with respect to 9 is simply the expectation of the gradient for maximum likelihood Cml according to 
the target distribution q{y\t). Here too, if the expectation is not tractable, one can using sampling to approximate it. In 
particular, since we have total control over the form of q{y\t), it is easy to define it such that it can be sampled from exactly. 



3.4 Analyzing the Behavior of the Training Objectives 

Figure 1 shows how the gradient with respect to the energy changes for each objective as we consider configurations y with 
varying energy and loss values. From this figure we see significant differences in the behaviors of the introduced objective 
functions. Only the expected-loss and loss-inspired Kullback-Leibler objectives will attempt to lower the energies of 
configurations that have non-zero loss. The maximum likelihood objective aims to raise the energies of the non-zero loss 
configurations, in proportion to how probable they are. On the other hand the loss-augmented and loss-scaled objectives 
concentrate on the most probable configurations that have the highest loss (worst violators), with the loss-scaled objective 
having the most extreme behavior and putting all the gradient on the worst violator. This behavior is expected as the 
energies get amplified by the addition (multiplication) of the loss which artificially raises the probability of the already 
probable violators. 

The behavior of the expected-loss objective is counter-intuitive as it tries to lower the energy of all configurations that 



have low loss, including those that are already more probable than the zero-loss one. In this example, it even pushes down 
more the energy of a non-zero loss configuration more than that of the zero-loss (target) configuration. The loss-inspired 
KL objective adjusts this and only lowers the energy of the zero-loss (ground-truth) and the low-loss configuration that has 
low probabihty. 

4 Learning with Multiple Ground Truths 

In certain applications, for some given input Xf, there is not only a single target yj that is correct (see Section 6 for the 
case of ranking). This information can easily be encoded within the loss function, by setting kiy) = for all such valid 
predictions. 

In this context, maximum likelihood training corresponds to the objective: 

Cml{t^;0) = -j^ Y, E iogp(y*l^*) (10) 

where 3^o(xt) — {y|y £ yi^t), hiy) = 0}. This is equivalent to maximizing the likelihood of all predictions y that are 
consistent with the loss, i.e. that have zero loss. The loss-augmented variant is similarly adjusted. As for loss-scaling, we 
replace the energy at the ground truth with the average energy of all valid ground truths in the loss-scaled energy: 

i?LS(y,x,;0)=;,(y) |i?(y,x,;0)-— ^ ^ i?(y„ x,; 0) ] - Z,(y) . (11) 

No changes to the average expected loss and loss-inspired KL objectives are necessary as they consider all valid y. 

In the setting of multiple ground truths, a clear distinction can be made between the average expected loss and the 
other objectives, in terms of the solutions they encourage. Indeed, the expected loss will be minimized as long as 
^ ev (x )P(ytl^t) = 1' i-^- probability mass is only put on configurations of y that have zero loss. On the other 
hand, the maximum likelihood and loss upper bound objectives add the requirement that the mass be equally distributed 
amongst those configurations. As for the loss-inspired KL, it requires that the sum of the probability mass sum to a constant 
smaller than 1, specifically 1 - J2yteyt{^t) ^(yiK)- 

5 Related Work 

While maximum likelihood is the dominant approach to training CRFs in the literature, others have proposed ways of 
adapting the CRF training objective for specific tasks. For sequence labeling problems, Kakade et al. [19] proposed to 
maximize the label-wise marginal likelihood instead of the joint label sequence likelihood, to reflect the fact that the task's 
loss function is the sum of label-wise classification errors. Suzuki et al. [20], Gross et al. [21] went a step further by 
proposing to directly optimize a smoothed version of the label-wise classification error (Suzuki et al. [20] also described 
how to apply it to optimize an F-score). Their approach is similar to the average expected loss described in Section 3.2, 
however they do not discuss how to generalize it to arbitrary loss functions. The average expected loss objective for CRFs 
was formulated by Taylor et al. [22] and Volkovs and Zemel [13], in the context of ranking. 

Work in other frameworks than CRFs for structured output prediction have looked at how to incorporate loss informa- 
tion into learning. Tsochantaridis et al. [15] describe how to upper bound the average loss with margin and slack scaling. 
McAllester et al. [23] propose a perceptron-like algorithm based on an update which in expectation is close to the gradient 
on the true expected loss (i.e., the expectation is with respect to the true generative process). Both SSVMs and perceptron 



algorithms require procedures for computing the so-called loss-adjusted MAP assignment of the output y which, for richly 
structured losses, can be intractable. One advantage of CRFs is that they can instead leverage the vast MCMC litera- 
ture to sample from CRFs with loss-adjusted energies. Moreover, they open the door to alternative (i.e. not necessarily 
upper-bounding) objectives. 

Finally, while Hazan and Urtasun [18] described how margin scaling can be applied to CRFs, we give for the first time 
the equivalent of slack scaling for CRFs in Section 3.1. 

6 Experiments 

We evaluate the usefulness of the different loss-sensitive training objectives on ranking tasks. In this setting, the input 
X = (q, D) corresponds to a pair made of a query vector q and a set of documents D = {d'*'}, and y is a vector 
corresponding to a ranking^ of each document d^*^ among the whole set of documents D. 

Ranking is particularly interesting as a benchmark task for loss-sensitive training of CRFs for two reasons. The first is 
the complexity of the output space 3^(q, D), which corresponds to all possible permutations of documents D, making the 
application of CRFs to this setting more challenging than sequential labeling problems with chain structure. 

The second is that learning to rank is an example of a task with multiple ground truths (see Section 4), which is a more 
challenging setting than the single ground truth case. Indeed, for each training input x^ = (qt, D^), we are not given a 
single target rank y^, but a vector rj of relevance level values for each document. The higher the level, the more relevant 
the document is and the better its rank should be. Moreover, two documents dj and dj with the same relevance level 
(i.e., rti — rtj) are indistinguishable in their ranking, meaning that they can be swapped within some ranking without 
affecting the quality of that ranking. 

The quality of a ranking is measured by the Normalized Discounted Cumulative Gain: 

NDCGiy,r,)^N,Y^p^l^ (12) 

^ log(l + y^) 

where Nt — l/NDCG{argsort{~rt), r^) is a normalization constant that insures the maximum value of NDCG is 1, 
which is achieved when documents are ordered in decreasing order of their relevance levels. Note that this is not a standard 
definition of NDCG, we use it here because this form was adopted for evaluation of the baselines on the Microsoft's 
LETOR4.0 datset collection [24]. To convert NDCG into a loss, we simply define ^((y) = 1 - NDCG{y, r^). 

A common approach to ranking is to learn a scoring function /(q, d*^*') which outputs for all documents d*^*) <E D a 
score corresponding to how relevant document d'*) is for query q. Here, we follow the same approach by incorporating 
this scoring function into the energy function of the CRF We use an energy function linear in the scores: 

|D| 

i?(y,q,D) = ^a,J(q,d«) (13) 

1=1 

where a is a weight vector of decreasing values (i.e., ai > aj for i < j). In our experiments, we use a weighting inspired 
by the NDCG measure: Ui — log(2)/ log(i + 1). Using this energy function, we can show that the prediction y(q, D) is 
obtained by sorting the documents in decreasing order of their scores: 

y(q, D) = arg min E{y, q, D) = argsort([-/(q, d^)), . . . , -/(q, dd^D)]) . (14) 



'For example, if yi = 3, then document d'*' is ranlced tliird amongst all documents D for the query q. 
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Figure 2: NDCG@ 1-5 results on MQ2008 and MQ2007 datasets for different learning objectives. 

As for the scoring function, we use a simple linear function /(q, d^*)) — 6^(j){q, d'*)) on a joint query-document feature 
representation 0(q, d*^*^). A standard feature representation is provided in each ranking datasets we considered. 

We trained CRFs according to maximum likelihood as well as the different loss-sensitive objectives described in Sec- 
tion 3. In all cases, stochastic gradient descent was used by iterating over queries and performing a gradient step update 
for each query. Moreover, because the size of 3^(q, D) is factorial in the number of documents, explicit summation over 
that set is only tractable for a small number of documents. To avoid this problem we use an approach similar to the one 
suggested by Petterson et al. [25]. Every time a query qt is visited and its associated set of documents Dt is greater than 
6, we randomly select a subset of 6 documents D C Dt, ensuring that it contains at least one document of every relevance 
level found for that query. The exact parameter gradients can then be computed for this reduced set by enumerating all 
possible permutations, and the CRF can be updated. 

6.1 Datasets 

In our experiments we use the LETOR [24] benchmark datasets. These data sets were chosen because they are publicly 
available, include several baseline results, and provide evaluation tools to ensure accurate comparison between methods. 
In LETOR4.0 there are two learning to rank data sets MQ2007 and MQ2008. MQ2007 contains 1692 queries with 69623 
documents and MQ2008 contains 784 queries and a total of 15211 documents. Each query document pair is assigned 
one of three relevance judgments: 2 = highly relevant, 1 = relevant and = irrelevant. Both datasets come with five 
precomputed folds with 60/20/20 slits for training validation and testing. The results show for each model the averages of 
the test set results for the five folds. 



6.2 Results 

We experimented with five objective functions, namely: maximum likelihood (ML), loss-augmented ML (LA), loss-scaled 
ML (LS), expected loss (EL) and loss-inspired Kullback-Leibler divergence (KL). For the loss-augmented objective we 
introduced an additional weight a > modifying the energy to: Et{y,xt;d) — E{y,xt;0) — alt{y). In this form 
a controls the contribution of the loss to the overall energy. For all objectives we did a sweep over learning rates in 



Table 1: NDCG@l-5 results on MQ2008 and MQ2007 datasets. 
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AdaRank 
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40.67 
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46.53 


48.21 


KL 


41.06 


40.90 


40.93 


41.33 


41.75 


39.47 


41.80 


43.74 


46.18 


47.84 



[0.5, 0.01, 0.01, 0.001]. Moreover we experimented with a in [1, 10, 20, 50] for the loss-augmented objective and T in 
[1, 10, 20, 50] for the KL objective. For each fold the setting that gave the best validation NDCG was chosen and the 
corresponding model was then tested on the test set. 

The results for the five objective functions are shown in Figures 2(a) and 2(b). First, we see that in almost all cases loss- 
augmentation produces better results than the base maximum likelihood approach. Second, loss-scaling further improves 
on the loss-augmentation results and has similar performance to the expected objective. Finally, among all objectives, 
KL consistently produces the best results on both datasets. Taken together, these results strongly support our claim that 
incorporating the loss into the learning procedure of CRFs is important. 

Comparisons of the CRFs trained on the KL objective with other models is also shown in Table 1, where the perfor- 
mance of linear regression and other linear baselines listed on LETOR's website is provided. We see that KL outperforms 
the baselines on the MQ2007 dataset on all truncations except 4. Moreover, on MQ2008 the performance KL is compara- 
ble to the best baseline AdaRank, with KL beating AdaRank on NDCG @ 1 . We note also that KL consistently outperforms 
LETOR's SVM-Struct baseline. 

7 Conclusion 

In this work, we explored different approaches to incorporating loss function information into the training objective of a 
probabilistic CRF. We discussed how to adapt ideas from the SS VM literature to the probabilistic context of CRFs, intro- 
ducing for the first time the equivalent of slack scaling to CRFs. We also described objectives that rely on the probabilistic 
nature of CRFs, including a novel loss-inspired KL objective. In an empirical comparison on ranking benchmarks, this 
new KL objective was shown to consistently outperform all other loss-sensitive objectives. 

To our knowledge, this is the broadest comparison of loss-sensitive training objectives for probabilistic CRFs yet to be 
made. It strongly suggests that the most popular approach to CRF training, maximum likelihood, is likely to be suboptimal. 
While ranking was considered as the benchmark task here, in future work, we would like to extend our empirical analysis 
to other tasks such as labeling tasks. 
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